IEEE Transactions on Computational Biology and Bioinformatics — Latest Matching Preprints

1

Genome-AC-GAN: Enhancing Synthetic Genotype Generationthrough Auxiliary Classification

Ahronoviz, S.; Gronau, I.

2024-02-16 genomics 10.1101/2024.02.14.580420 medRxiv

Top 0.1%

10.0%

Show abstract

In recent years, there have been increasing attempts to develop computational methods for generating synthetic genomic data that aim to mimic real genomic datasets. Artificial genomes (AGs) generated by these methods have emerged as a promising potential solution for privacy concerns raised by public genomic datasets and as means to provide adequate representation of under-sampled populations. However, existing methods for generating AGs provide a very limited capability for faithfully capturing features of different sub-populations within a larger cohort. In this study, we propose a novel method called the Genome Auxiliary Classifier Generative Adversarial Network (Genome-AC-GAN), which generates AGs tailored to specific sub-populations. We conducted experiments to evaluate the performance of the Genome-AC-GAN and compare the AGs it generates with real genomic data as well as with AGs generated by previously published methods. The Genome-AC-GAN outperforms other methods and faithfully models population structure, which is not adequately captured by existing methods. We also demonstrate the use of AGs generated by the Genome-AC-GAN in augmentation of datasets used as training sets for classifying genomes into populations. These experiments demonstrate the benefits of AGs in enhancing classification accuracy, especially when dealing with under-sampled and closely related populations.

2

DVPNet: A New XAI-Based Interpretable Genetic Profiling Framework Using Nucleotide Transformer and Probabilistic Circuits

Kusumoto, T.

2026-01-30 bioinformatics 10.64898/2026.01.28.695053 medRxiv

Top 0.1%

9.2%

Show abstract

In this study, we present an XAI-based genetic profiling framework that quantifies gene importance for distinguishing cancer cells from normal cells based on an interpretable AI decision process. We propose a new explainable AI (XAI) classification model that combines probabilistic circuits with the Nucleotide Transformer. By leveraging the strong feature-extraction capability of the Nucleotide Transformer, we design a tractable classification framework based on probabilistic circuits while preserving probabilistic interpretability. To demonstrate the capability of this framework, we used the GSE131907 single-cell lung cancer atlas and constructed a dataset consisting of cancer-cell and normal-cell classes. From each sample, 900 gene types were randomly selected and converted into embedding vectors using the Nucleotide Transformer, after which the classification model was trained. We then extracted class-specific probabilistic contributions from the tractable model and defined a contribution score for the cancer-cell class. Genetic profiling was performed based on these scores, providing insights into which genes and biological pathways are most important for the classification task. Notably, 1,524 of the 9,540 observed genes showed contribution scores that contradicted what would be expected from their class-wise occurrence frequencies, suggesting that the profiling goes beyond simple statistics by leveraging biological feature representations encoded by the Nucleotide Transformer. The top-ranked genes among these contradictory cases include several well-studied genes in cancer research (e.g., ITGA5, SIGLEC9, NOTUM, and TP73). Overall, these analyses go beyond traditional statistical or gene-expression-level approaches and provide new academic insights for genetic research.

3

Deep Generative Models for Discrete Genotype Simulation

Xie, S.; Tribout, T.; Boichard, D.; Hanczar, B.; Chiquet, J.; Barrey, E.

2025-08-12 bioinformatics 10.1101/2025.08.08.669289 medRxiv

Top 0.1%

8.5%

Show abstract

Deep generative models open new avenues for simulating realistic genomic data while preserving privacy and addressing data accessibility constraints. While previous studies have primarily focused on generating gene expression or haplotype data, this study explores generating genotype data in both unconditioned and phenotype-conditioned settings, which is inherently more challenging due to the discrete nature of genotype data. In this work, we developed and evaluated commonly used generative models, including Variational Autoencoders (VAEs), Diffusion Models, and Generative Adversarial Networks (GANs), and proposed adaptation tailored to discrete genotype data. We conducted extensive experiments on large-scale datasets, including all chromosomes from cow and multiple chromosomes from human. Model performance was assessed using a well-established set of metrics drawn from both deep learning and quantitative genetics literature. Our results show that these models can effectively capture genetic patterns and preserve genotype-phenotype association. Our findings provide a comprehensive comparison of these models and offer practical guidelines for future research in genotype simulation. We have made our code publicly available at https://github.com/SihanXXX/DiscreteGenoGen.

4

E2EGraph: An End-to-end Graph Learning Model for Interpretable Prediction of Pathlogical Stages in Prostate Cancer

Zhan, W.; Song, C.; Das, S.; Rebbeck, T. R.; Shi, X.

2023-03-12 bioinformatics 10.1101/2023.03.09.531924 medRxiv

Top 0.1%

8.4%

Show abstract

Prostate cancer is one of the deadliest cancers worldwide. An accurate prediction of pathological stages using the expressions and interactions of genes is effective for clinical assessment and treatment. However, identification of interactions using biological procedure is time consuming and prohibitively expensive. A graph is a powerful representation for the complex interactome of genes, their transcripts, and proteins. Recently, Graph Neural Networks (GNNs) have gained great attention in machine learning due to their capability to capture the graphical interactions among data entities. To leverage GNNs for predicting pathological stage stages, we developed an end-to-end graph representation and learning model, namely E2EGraph, which can automatically generate a graph representation using gene expression data and a multi-head graph attention network to learn the strength of interactions among genes and make the prediction. To ensure the reliability of model prediction, we identify critical components of graph representation and GNN model to interpret prediction results from multiple perspectives at gene and patient levels. We evaluated E2EGraph to predict pathological stages of prostate cancer using The Cancer Genome Atlas (TCGA) data. Our experimental results demonstrate that E2EGraph reaches the state-of-art prediction performance while being effective in identifying marker genes indicated by interpretability. Our results point to a direction where adaptive graph construction and attention based GNNs can be leveraged for various prediction tasks and interpretation of model prediction in a variety of data domains including disease prediction.

5

Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation

Huang, P.; Charton, F.; Schmelzle, J.-N. M.; Darnell, S. S.; Prins, P.; Garrison, E.; Suh, G. E.

2024-09-20 bioinformatics 10.1101/2024.09.18.612131 medRxiv

Top 0.1%

8.3%

Show abstract

Language Models (LM) have been extensively utilized for learning DNA sequence patterns and generating synthetic sequences. In this paper, we present a novel approach for the generation of synthetic DNA data using pangenomes in combination with LM. We introduce three innovative pangenome-based tokenization schemes that enhance DNA sequence generation. Our experimental results demonstrate the superiority of pangenome-based tokenization over classical methods in generating high-utility synthetic DNA sequences, highlighting significant improvements in training efficiency and sequence quality.

6

Improving Hi-C contact matrices using genome graphs

Shen, Y.; Yu, L.; Qiu, Y.; Zhang, T.; Kingsford, C.

2023-11-12 genomics 10.1101/2023.11.08.566275 medRxiv

Top 0.1%

8.2%

Show abstract

Three-dimensional chromosome structure plays an important role in fundamental genomic functions. Hi-C, a high-throughput, sequencing-based technique, has drastically expanded our comprehension of 3D chromosome structures. The first step of Hi-C analysis pipeline involves mapping sequencing reads from Hi-C to linear reference genomes. However, the linear reference genome does not incorporate genetic variation information, which can lead to incorrect read alignments, especially when analyzing samples with substantial genomic differences from the reference such as cancer samples. Using genome graphs as the reference facilitates more accurate mapping of reads, however, new algorithms are required for inferring linear genomes from Hi-C reads mapped on genome graphs and constructing corresponding Hi-C contact matrices, which is a prerequisite for the subsequent steps of the Hi-C analysis such as identifying topologically associated domains and calling chromatin loops. We introduce the problem of genome sequence inference from Hi-C data mediated by genome graphs. We formalize this problem, show the hardness of solving this problem, and introduce a novel heuristic algorithm specifically tailored to this problem. We provide a theoretical analysis to evaluate the efficacy of our algorithm. Finally, our empirical experiments indicate that the linear genomes inferred from our method lead to the creation of improved Hi-C contact matrices. These enhanced matrices show a reduction in erroneous patterns caused by structural variations and are more effective in accurately capturing the structures of topologically associated domains.

7

Data-Driven Symbolic Higher-Order Epistasis Discovery with Kolmogorov-Arnold Networks

Patil, O. R.; Shazand, K.; Marteau, B.; Shen, Y.; Wang, M. D.

2025-11-04 genomics 10.1101/2025.10.31.685894 medRxiv

Top 0.1%

8.1%

Show abstract

Many human diseases are polygenic conditions that arise from a complex interplay of interactions between multiple genes at different loci, but currently most Genome-Wide Association Studies (GWAS) largely only consider the main additive effects of single nucleotide polymorphisms (SNPs), resulting in a missing heritability problem in some complex traits. Identifying non-additive interactions, or epistasis, at a higher-order could aid in filling this gap, but it is computationally difficult due to the massive search space involved. Current epistasis detection approaches struggle with noncartesian higher order interactions and lack inherent explainability. We present a novel deep learning (DL) approach, EPIstasis Discovery with Kolmogorov-Arnold Networks (EPIK), a data-driven, modular, and symbolically representable framework. We also introduce a novel approach for higher-order XOR (a non-Cartesian type) interaction detection, utilized in EPIKs XOR detection module. EPIK slightly outperforms other DL approaches on simulated pure epistasis interactions benchmark in average F1 score. It outperforms other, general, traditional epistasis detection approaches on simulated mixed epistasis detection datasets and real-world GWAS datasets of Arabidopsis Thaliana. Finally, EPIK recovers a known gene interaction between MAPT and WNT3 for Parkinsons Disease (PD) while also suggesting a more complex interaction between MAPT, WNT3, and another gene, KANSL1.

8

Protein Function Prediction via Contig-Aware Multi-Level Feature Integration

Yang, L.; Du, K.; Lu, Y.; Wang, M.; Zhang, H.; Yang, S.; Lin, Y.; Zhuo, J.; Zhang, D.; Jiang, Y.; Zhang, X.; Li, S.

2025-08-11 bioinformatics 10.1101/2025.08.07.669053 medRxiv

Top 0.1%

8.0%

Show abstract

Proteins play a central role in biological processes, and accurately predicting their functions is crucial for biomedical research. While computational methods have advanced significantly, most approaches rely solely on sequence or structure, neglecting critical inter-protein relationships, such as the topological arrangement of coding sequences (CDSs) within contigs. To address this gap, we propose CAML, a novel deep learning model that integrates intra-protein features including sequence and predicted structure with inter-protein features capturing functional linkages among CDSs in contigs. Specifically, CAML employs a Graph Isomorphism Network (GIN) to extract structural features from predicted protein contact graphs and ESM-2 for sequence embeddings. Additionally, it leverages kmer frequencies and a Bidirectional Long Short-Term Memory (BiLSTM) network to model functional relationships among colocalized CDSs within contigs, capturing operon-like associations. Extensive experiments demonstrate that CAML outperforms the state-of-the-art methods in accuracy, precision, recall and F1-score, achieving improvements of 11.24%, 12.43%, 13.59%, and 13.30%, respectively over the second-best model. Ablation studies further confirm the critical contribution of CAMLs multi-level biological feature integration in enhancing functional annotation accuracy. Our study demonstrates the importance of integrating structural, sequential, and CDSs topological features for accurate protein function prediction, providing a robust computational framework for genomics research.

9

gGN: learning to represent graph nodes as low-rank Gaussian distributions

Edera, A. A.; Stegmayer, G.; Milone, D. H.

2022-11-17 bioinformatics 10.1101/2022.11.15.516704 medRxiv

Top 0.1%

6.9%

Show abstract

Unsupervised learning of node representations from knowledge graphs is critical for numerous downstream tasks, ranging from large-scale graph analysis to measuring semantic similarity between nodes. This study presents gGN as a novel representation that defines graph nodes as Gaussian distributions. Unlike existing representations that approximate such distributions using diagonal covariance matrices, our proposal approximates them using low-rank perturbations. We demonstrate that this low-rank approximation is more expressive and better suited to represent complex asymmetric relations between nodes. In addition, we provide a computationally affordable algorithm for learning the low-rank representations in an unsupervised fashion. This learning algorithm uses a novel loss function based on the reverse Kullback-Leibler divergence and two ranking metrics whose joint minimization results in node representations that preserve not only node depths but also local and global asymmetric relationships between nodes. We assessed the representation power of the low-rank approximation with an in-depth systematic empirical study. The results show that our proposal was significantly better than the diagonal approximation for preserving graph structures. Moreover, gGN also outperformed 17 methods on the downstream task of measuring semantic similarity between graph nodes.

10

Flippable Siamese Differential Neural Network for Differential Graph Inference

Leng, J.; Yu, J.; Wu, L.-Y.

2025-09-04 bioinformatics 10.1101/2025.08.31.673390 medRxiv

Top 0.1%

6.9%

Show abstract

Differential graph inference is a critical analytical technique that enables researchers to accurately identify the variables and their interactions that change under different conditions. By comparing two conditions, researchers can gain a deeper understanding of the differences between them. Currently, the mainstream methods in differential graph inference are mathematical optimization algorithms, including sparse optimization based on Gaussian graphical models or sparse Bayesian regression. These methods can eliminate many false positives in graphs, but at the cost of heavy reliance on the prior distributions of data or parameters, and they suffer from the curse of dimensionality. To address these challenges, we introduce a new architecture called the Flippable Siamese Differential Neural Network (FSDiffNet). We originally established the concept of flippability and the theoretical foundation of flippable neural networks, laying the groundwork for building a flippable neural network. This theoretical framework guided the design of architecture and components, including the SoftSparse activation function and high-dilation circular padding diagonal convolution. FSDiffNet uses large-scale pre-training techniques to acquire differential features and perform differential graph inference. Through experiments with simulated and real datasets, FSDiffNet outperforms existing state-of-the-art methods on multiple metrics, effectively inferring key differential factors related to conditions such as autism and breast cancer. This proves the effectiveness of FSDiffNet as a solution for differential graph inference challenges.

11

GP-ML-DC: An Ensemble Machine Learning-Based Genomic Prediction Approach with Automated Two-Phase Dimensionality Reduction via Divide-and-Conquer Techniques

Liu, Q.; Ma, H.; Zhang, Z.; Hu, Z.; Wang, X.; Li, R.; Cai, Y.; Jiang, Y.

2024-12-26 bioinformatics 10.1101/2024.12.26.630443 medRxiv

Top 0.1%

6.7%

Show abstract

Traditional machine learning (ML) and deep learning (DL) methods for genome prediction often face challenges due to the imbalance between the limited number of samples (n) and the large number of single nucleotide polymorphisms (SNPs) (p), where n is much smaller than p. To address this, we propose GP-ML-DC, an innovative genome predictor that combines traditional ML and DL models with a unique two-phase, parameter-free dimensionality reduction technique. Initially, GP-ML-DC reduces feature dimensionality by characterizing genes as features. Building on big data methodologies, it employs a divide-and-conquer approach to segment gene regions into multiple haplotypes, further decreasing dimensionality. Each haplotype segment is processed by a sub-task based on traditional ML, followed by integration via a neural network that synthesizes the results of all sub-tasks. Our experiments, conducted on four cattle milk-related traits using ten-fold cross-validation and independent testing, show that GP-ML-DC significantly surpasses current state-of-the-art genome predictors in prediction performance.

12

CLCNet: a contrastive learning and chromosome-aware network for genomic prediction in plants

Huang, J.; Yang, Z.; Yin, M.; Li, C.; Li, J.; Wang, Y.; Huang, L.; He, F.; Liang, C.; Li, M.; Han, R.; Jiang, Y.

2024-12-30 genomics 10.1101/2024.12.29.630569 medRxiv

Top 0.1%

6.7%

Show abstract

Genomic selection (GS) uses genome-wide markers and phenotypes to predict complex traits and breeding values. The effectiveness of GS critically depends on the accuracy of genomic prediction (GP) models. However, traditional GP frequently encounter difficulties in accurately capturing inter-individual variability and are often confronted with the challenge posed by the curse of dimensionality, with features like SNPs far exceeding sample sizes, thereby severely restricting their predictive performance. To address these challenges, we present CLCNet (Contrastive Learning and Chromosome-aware Network), a novel deep learning framework that integrates multi-task learning with contrastive learning for GP. CLCNet comprises two key components: (i) a contrastive learning module that enhances the models ability to capture fine-grained, genotype-dependent phenotypic differences among individuals, and (ii) a chromosome-aware module that performes structured feature selection at both chromosome and genome levels, retaining the most informative SNPs. CLCNet was evaluated across four major crop species, including maize (Zea mays), cotton (Gossypium hirsutum), rapeseed (Brassica napus), and soybean (Glycine max), covering ten agronomically important traits, and was compared with a diverse set of classical linear, machine learning, and deep learning models. Across most traits, CLCNet achieved top prediction performance, with statistically significant improvements in Pearson correlation coefficient (PCC), typically ranging from 0.3% to 6.5% over strong baseline models, together with reduced mean squared error (MSE). Notably, the advantages of CLCNet were pronounced for traits in maize, rapeseed, and soybean, while for cotton traits largely governed by additive genetic effects, its performance remained stable and did not show any decline. Overall, these results demonstrated that CLCNet provided a robust and effective framework for improving genomic prediction accuracy and holds substantial potential for accelerating genetic gain in plant breeding.

13

A Benchmark of Evo2 Genomic AI Models for Efficient and Practical Deployment

Li, H.; ji, h.; Zeng, Y.; Lv, W.; Wu, J.; Liu, S.; Lin, C.; Yang, H.; Li, Z.; Chen, Y.; Dong, W.

2025-09-12 genomics 10.1101/2025.09.10.675279 medRxiv

Top 0.1%

6.6%

Show abstract

The rapid advancement of DNA foundation language models has brought about a transformative shift in genomics, allowing for the deciphering of intricate patterns and regulatory mechanisms embedded within DNA sequences. The genomic foundation model Evo2 demonstrates remarkable capabilities in decoding DNA functional patterns through cross-species pretraining. However, despite the great potential of Evo2 in basic genomics research, there is currently no clear and systematic guidance on its specific application scenarios, performance, and optimization directions in the field of tumor genomics, and its performance dependency on specialized hardware (such as FP8 precision on H800 GPUs) has not been empirically benchmarked. Here, we present a focused validation of Evo2 using two independent cancer genomic datasets (Bladder Urothelial Carcinoma and Ovarian Cancer), we tested the downstream tasks of Evo2, including the prediction of tumor pathogenic variants and the prediction of mutational effects, and compared its performance on A100 and H800 GPUs. The results show that critical importance of FP8 precision, enabling the H800 to achieve a 4x faster inference speed than the A100 with stable accuracy (AUC 0.88-0.95). The 7B-parameter model emerged as the top performer, whereas the 40B model experienced a severe performance drop (AUC to 0.48) on non-FP8 hardware like the A100. These findings empirically validated Evo2s hardware specifications and provided practical insights for researchers implementing the model with similar computational resources. Futhermore, our findings provide a framework for the application and optimization of downstream tasks of the DNA language model Evo2 in cancer, and can guide researchers in effectively applying it in genomic studies. Key PointsO_LIHardware Precision Impact: FP8 precision on H800 GPUs is critical for Evo 2s performance, enabling 4x faster inference than A100 (without FP8 support) while maintaining high accuracy (AUC 0.88-0.95). C_LIO_LIModel Scale Optimization: The 7B-parameter model outperformed larger variants (e.g., 40B), which suffered severe accuracy drops (AUC as low as 0.48) on non-FP8 hardware, highlighting a balance between efficiency and performance. C_LIO_LIPractical Guidelines: We provide a framework for deploying Evo 2 in cancer genomics, including hardware recommendations, dataset curation, and downstream task optimization--valuable for researchers with varied computational resources. C_LI

14

Fine-Tuning Protein Language Models Enhances the Identification and Interpretation of the Transcription Factors

Hassan, M. T.; Gaffar, S.; Zahid, H.; Lee, S. J.

2025-11-28 bioinformatics 10.1101/2025.11.27.691010 medRxiv

Top 0.1%

6.5%

Show abstract

Transcription factors (TFs) are pivotal regulators of gene expression and play essential roles in diverse cellular activities. The three-dimensional organization of the genome and transcriptional regulation are predominantly orchestrated by TFs. By recruiting the transcriptional machinery to gene enhancers or promoters, TFs can either activate or repress transcription, thereby controlling gene activity and various biological pathways. Accurate identification of TFs is vital for elucidating gene regulatory mechanisms within cells. However, experimental identification remains labor-intensive and time-consuming, highlighting the necessity for efficient computational approaches. In this study, we present a two-layer predictive framework utilizing protein language models (pLMs) via full fine-tuning and parameter-efficient fine-tuning. The initial layer robustly classifies and identifies transcription factors, while the subsequent layer predicts TFs with a binding preference for methylated DNA (TFPMs). Our approach further incorporates attention weights and protein sequence motifs to enhance interpretability and predictive capability. By leveraging attention mechanisms, we highlight biologically relevant regions of the protein sequences that contribute most strongly to the predictions. Additionally, motif analysis facilitates the identification of conserved sequence patterns that are critical for TF recognition and function. Across both TF and TFPM classification tasks, the inclusion of these features allowed our methods to consistently surpass contemporary models, as demonstrated by independent test results. KeypointsO_LIDeveloped a two-layer predictive framework using protein language models (pLMs) with both full fine-tuning and parameter-efficient fine-tuning methods. C_LIO_LIThe first layer accurately identifies transcription factors (TFs), and the second layer predicts TFs with binding preference for methylated DNA (TFPMs). C_LIO_LIIntegrated attention weights and protein sequence motifs to enhance model interpretability by highlighting biologically relevant sequence regions and conserved patterns. C_LIO_LIAchieved superior performance compared to state-of-the-art methods, validated by independent testing. C_LI Mir Tanveerul Hassan obtained his M.Tech. in Computer Science from the University of Kashmir, India, in 2020, and later earned his Ph.D. in Electronics and Information Engineering from Jeonbuk National University, Jeonju, South Korea. He is currently serving as a postdoctoral fellow at the Jeonbuk RICE Intelligence Innovation Research Center. His research interests encompass computational biology, bioinformatics, and pattern recognition. Saima Gaffar received her B.Tech. and M.Tech. degrees in Computer Science from the University of Kashmir, Srinagar, India, and her Ph.D. in Electronics and Information Engineering from Jeonbuk National University, South Korea. Her research focuses on bioinformatics, computational biology, deep learning, and image processing. Hamza Zahid received his B.S. degree in Mechatronics Engineering from the University of Engineering and Technology, Peshawar, Pakistan. He is currently pursuing the integrated M.S. and Ph.D. degrees in Electronics and Information Engineering at Jeonbuk National University, South Korea. His primary research interests include the applications of artificial intelligence in computational drug discovery. Sang Jun Lee received his B.S., M.S., and Ph.D. degrees in Electrical Engineering from POSTECH, South Korea. Following his doctoral studies, he worked as a senior researcher at the Samsung Advanced Institute of Technology (SAIT). He is currently an Associate Professor in the Division of Electronics and Information Engineering at Jeonbuk National University, South Korea. His research interests include image analysis, deep learning, and medical image processing.

15

KMer-Node2Vec: Learning Vector Representations of K-mers from the K-meGraph

Yu, Z.; Yang, Z.; Lan, Q.; Huang, F.; Cai, Y.

2022-09-01 bioinformatics 10.1101/2022.08.30.505832 medRxiv

Top 0.1%

6.5%

Show abstract

Learning low-dimensional continuous vector representation for short k-mers divided from long DNA sequences is key to DNA sequence modeling that can be utilized in many bioinformatics investigations, such as DNA sequence classification and retrieval. DNA2Vec is the most widely used method for DNA sequence embedding. However, it poorly scales to large data sets due to its extremely long training time in kmer embedding. In this paper, we propose a novel efficient graph-based kmer embedding method, named Kmer-Node2Vec, to tackle this concern. Our method converts the large DNA corpus into one kmer co-occurrence graph and extracts kmer relation on the graph by random walks to learn fast and high-quality kmer embedding. Extensive experiments show that our method is faster than DNA2Vec by 29 times for training on a 4GB data set, and on par with DNA2Vec in terms of task-specific accuracy of sequence retrieval and classification.

16

RCoxNet: deep learning framework for enhanced cancer survival prediction integrating random walk with restart with mutation and clinical data

Kumari, S.; Gujral, S.; Panda, S.; Gupta, P.; Ahuja, G.; Sengupta, D.

2024-09-18 bioinformatics 10.1101/2024.09.17.613428 medRxiv

Top 0.1%

6.5%

Show abstract

Cancer poses a significant global health challenge, characterized by a complex disease progression and disrupted growth regulation. A thorough understanding of cellular and molecular biological mechanisms is essential for developing novel treatments and improving the accuracy of patient survival predictions. While prior studies have leveraged gene expression and clinical data to forecast survival outcomes through current machine learning and deep learning approaches, gene mutation data--despite being a widely recognized metric--has rarely been incorporated due to its limited information, inadequate representation of gene relationships, and data sparsity, which negatively affects the robustness, effectiveness, and interpretability of current survival analysis approaches. To overcome the challenges of mutation data sparsity, we propose RCoxNet, a novel deep learning neural network framework that integrates the Random Walk with Restart (RWR) algorithm with a deep learning Cox Proportional Hazards model. By applying this framework to mutation data from cBioportal, our model achieved an average concordance index of 0.62 {+/-} 0.05 across four cancer types, outperforming existing deep neural network models. Additionally, we identified clinical features critical for differentiating between predicted high- and low-risk patients, with the relevance of these features being partially supported by previous studies.

17

Dynamic Interaction Learning and MultimodalRepresentation for Drug Response Prediction

Bi, Y.; Hu, Z.; Lyu, G.; Zhou, M.; Zhang, S.

2022-11-25 bioinformatics 10.1101/2022.11.23.517777 medRxiv

Top 0.1%

6.4%

Show abstract

Mining multimodal pharmaceutical data is crucial for in-silico drug candidate screening and discovery. A daunting challenge of integrating multimodal data is to enable dynamic feature modeling generalizable for real-world applications. Unlike conventional approaches using a simple concatenation with fixed parameters, in this paper, we develop a dynamic interaction learning network to adaptively integrate drug and different reactants on multimodal tasks towards robust drug response prediction. The primary objective of dynamic learning falls into two key aspects: at micro-level, we aim to dynamically search specific relational patterns on the whole reactant range for each drug-reactant pair; at macro-level, drug features can be used to adaptively correlate with different reactants. Extensive experiments demonstrate the validity of our approach in both drug protein interaction (DPI) and cancer drug response (CDR) tasks. Our approach achieves superior performance on both DPI (AUC = 0.967) and CDR (AUC = 0.932) tasks, outperforming competitive baselines from four real-world, drug-outcome datasets. In addition, the performance on the challenging blind subsets is remarkably improved, where AUC value increases from 0.843 to 0.937 on blind protein set of DPI task, and Pearsons correlation value increases from 0.516 to 0.566 on blind drug set of CDR task. A series of case studies highlight the potential generalization and interpretability of dynamic learning in the in-silico drug response assessment.

18

DruID: Personalized Drug Recommendations by Integrating Multiple Biomedical Databases for Cancer

Liany, H.; Jeyasekharan, A.; Rajan, V.

2021-04-11 bioinformatics 10.1101/2021.04.11.439315 medRxiv

Top 0.1%

6.4%

Show abstract

Advances in next-generation sequencing technologies have led to the development of personalized genomic profiles in diagnostic panels that inform oncologists of alterations in clinically relevant genes. While targeted therapies for some alterations may be found, an effective therapeutic strategy should consider multiple and dependent genetic interactions that affect cancer progression, a task which remains challenging. There are ongoing efforts to profile cancer cells in-vitro, both to catalog their genomic information and study their sensitivity to various drugs. There is a need for tools that can interpret the personalized genomic profile of a patient in light of information from these biological and pre-clinical studies and recommend potentially useful drugs. To address this need, we develop a new algorithmic framework called DruID, to effectively combine drug efficacy predictions from a deep neural network model with information, such as drug sensitivity, drug-drug interactions and genetic dependencies, from multiple publicly available databases. We empirically evaluate DruID on cancer cell line data on which efficacy of many drugs have been experimentally determined. We find that DruID outperforms competing approaches and promises to be a useful tool in clinical decision-making.

19

Identifying Drug Sensitivity Subnetworks with NETPHIX

Kim, Y.-A.; Sarto Basso, R.; Wojtowicz, D.; Hochbaum, D. S.; Vandin, F.; Przytycka, T. M.

2019-10-22 bioinformatics 10.1101/543876 medRxiv

Top 0.1%

6.3%

Show abstract

Phenotypic heterogeneity in cancer is often caused by different patterns of genetic alterations. Understanding such phenotype-genotype relationships is fundamental for the advance of personalized medicine. One of the important challenges in the area is to predict drug response on a personalized level. The pathway-centric view of cancer significantly advanced the understanding of genotype-phenotype relationships. However, most of network identification methods in cancer focus on identifying subnetworks that include general cancer drivers or are associated with discrete features such as cancer subtypes, hence cannot be applied directly for the analysis of continuous features like drug response. On the other hand, existing genome wide association approaches do not fully utilize the complex proprieties of cancer mutational landscape. To address these challenges, we propose a computational method, named NETPHLIX (NETwork-to-PHenotpe mapping LeveragIng eXlusivity), which aims to identify mutated subnetworks that are associated with drug response (or any continuous cancer phenotype). Utilizing properties such as mutual exclusivity and interactions among genes, we formulate the problem as an integer linear program and solve it optimally to obtain a set of genes satisfying the constraints. NETPHLIX identified gene modules significantly associated with many drugs, including interesting response modules to MEK1/2 inhibitors in both directions (increased and decreased sensitivity to the drug) that the previous method, which does not utilize network information, failed to identify. The genes in the modules belong to MAPK/ERK signaling pathway, which is the targeted pathway of the drug.

20

Predicting 3D Chromatin Interactions Using Transformer-Enhanced Deep Learning Models

Xu, K.; Shen, L.

2025-04-16 bioinformatics 10.1101/2025.04.10.647995 medRxiv

Top 0.1%

6.3%

Show abstract

The three-dimensional (3D) structure of the human genome is essential for regulating gene expression and cellular functions. Chromatin interactions bring distant genomic regions into physical contact, enabling processes like gene regulation, DNA replication, and repair. Disruptions in this organization can lead to diseases such as cancer and genetic disorders. In this study, we propose a Transformer-based deep learning model to predict the chromatin interactions from DNA sequences. By developing a streamlined and efficient data pipeline to handle the sparse and noisy high-throughput chromosome conformation capture (Hi-C) sequencing data, our approach improves both data processing speed and model performance. The Transformers ability to capture long-range interactions among genomic regions via attention mechanism, combined with nucleotide position encoding, enables more accurate predictions than purely convolution-based models. This work highlights the potential of Transformer-based network architectures to advance our understanding of genome organization and paves the way for future research with large datasets and advanced network designs.